Analyzing Application Usage Data

User Searches for SXSW Interactive Sessions

At South by Southwest 2017, Watson Data Platform developer advocates presented a demonstration app that allowed conference attendees to quickly and contextually find those events of greatest interest to them. The app is described in this blog post.

This data science notebook analyzes the log of all user interactions in the application to help understand how it was used, and what topics were of greatest interest to users.

Prerequisites

We're going to use PixieDust to help visualize our data. You can learn more about PixieDust at https://ibm-cds-labs.github.io/pixiedust/. In the following cell we ensure we are running the lastest version of PixieDust. Be sure to restart your kernel if instructed to do so.


In [ ]:
!pip install --user --upgrade pixiedust

In [ ]:
import pixiedust
pixiedust.enableJobMonitor()

In [ ]:
from pyspark.sql.functions import explode, lower

Configure database connectivity

We've made our anonymous SXSW log data available at opendata.cloudant.com. If you are loading from your own Cloudant instance specify the appropriate credentials in the following cell.


In [ ]:
# Enter your Cloudant host name
host = 'opendata.cloudant.com'
# Enter your Cloudant user name
username = ''
# Enter your Cloudant password
password = ''
# Enter your source database name
database = 'sxswlog'

Load documents from the database

Load the documents into an Apache Spark DataFrame.


In [ ]:
# no changes are required to this cell
# obtain SparkSession
sparkSession = SparkSession.builder.getOrCreate()
# load data
if username:
    conversation_df = sparkSession.read.format("com.cloudant.spark").\
                                        option("cloudant.host", host).\
                                        option("cloudant.username", username).\
                                        option("cloudant.password", password).\
                                        load(database)
else:
    conversation_df = sparkSession.read.format("com.cloudant.spark").\
                                        option("cloudant.host", host).\
                                        load(database)

Document Structure

Each document in the database represents a single conversation made with the chatbot. Each conversation includes the user, date, and the steps of the conversation. The steps are stored in an array called dialogs (referring to the dialogs in Watson Conversation that were traversed as part of the conversation). Here is a sample conversation:

"_id": "018885a1fb6cf6dbb49a8e11542e7670", "_rev": "1-02239161bbfcbae37f5e85c43225fd4b", "user": "phoneeb14851fc4c343e1b5dd96c6ed9e3748", "date": 1489109308136, "dialogs": [ { "name": "get_music_topic", "message": "Music", "date": 1489343583979 }, { "name": "search_music_topic", "message": "Brass bands", "date": 1489343600650 } ]

In this particular conversation the user first told the chatbot they would like to search for music gigs by sending the message "Music" to the chatbot. The user then asked the chatbot to search for "Brass bands".

In the following cell we print the schema to confirm the structure of the documents.


In [ ]:
conversation_df.printSchema()

How many conversations where there?

Let's start by showing how many conversations the Chatbot recorded at SXSW:


In [ ]:
conversation_df.count()

How many users installed the chatbot?

At SXSW we were able to demonstrate the chatbot on a laptop and display, or give users the ability to run the chatbot from their own phones. They could interact with the chatbot via SMS, or a mobile-optimized version of the web app. When running from the laptop we stored the user value as "web" plus a uuid. When running from a user's phone we stored the user value as "phone" plus a uuid. We are most interested in what types of conversations were had by users who installed the chatbot. Here we filter down to only those conversations:


In [ ]:
phone_conversation_df = conversation_df.filter('user LIKE "phone%"')
phone_conversation_df.select('user').distinct().count()

How many conversations made by users from their phones?

This question gets any conversation that traversed more than 1 dialog. If the user simply said "hi" and then never conversed with the chatbot after that then we don't want to count it as a conversation.


In [ ]:
phone_conversation_df = phone_conversation_df.filter('size(dialogs) > 1')
phone_conversation_df.count()

Flatten the Cloudant JSON document structure

Each dialog contains a message field which contains the message sent by the user, and a name field which represents the action performed by the system based on the message sent by the user and the current dialog in the conversation as managed by Watson Conversation. For example, the name search_topic maps to the action of searching for Interactive sessions. The name search_film maps to the action of searching for film screenings. We want to do some analysis on specific actions and the messages associated with those actions, so in the next cell we convert each row (which has the dialog array) into multiple rows - one for each dialog. This will make it easier for us to filter and aggregate based on the message and name fields in the dialogs.


In [ ]:
phone_dialog_df = phone_conversation_df.select(explode(phone_conversation_df.dialogs).alias("dialog"))
phone_dialog_df = phone_dialog_df.select("dialog.date", 
                                         lower(phone_dialog_df.dialog.message).alias("message"), 
                                         "dialog.name")
phone_dialog_df.printSchema()

Display Dialogs in PixieDust

Below we display each dialog in a PixieDust table. You can see the date, message (the message the user sent to the chatbot), and name (the name of the action performed).


In [ ]:
display(phone_dialog_df)

How many searches for SXSW Interactive sessions?

As mentioned earlier, search_topic maps to the action of searching for SXSW Interactive sessions. Here we create a DataFrame with only those search actions:


In [ ]:
interactive_dialog_df = phone_dialog_df.filter(phone_dialog_df.name == 'search_topic')
interactive_dialog_df.count()

Next we group by message. Message is the message sent by the user. In this case it essentially represents the search term entered by the user for finding Interactive sessions. Here we aggregate and display the search terms across all users:


In [ ]:
interactive_dialog_by_message_df = interactive_dialog_df.groupBy('message').count().orderBy('count', ascending=False)
display(interactive_dialog_by_message_df)

Display PixieDust Bar Chart

Next, we'll take the top 20 most popular search terms and display them in a bar chart using PixieDust:


In [ ]:
display(interactive_dialog_by_message_df.limit(20))

Display PixieDust Pie Chart

Finally, we'll take the top 10 most popular search terms and display them in a pie chart using PixieDust:


In [ ]:
display(interactive_dialog_by_message_df.limit(10))

Conclusions

It's important to remember that our user population does not represent the SXSW attendee in general because the only people introduced to our app are those who chose to visit IBM's installation that week, and also chose to stop by our booth there. So what we can say is that SXSW attendees interested in IBM technology innovation are have an overwhelming interest in artificial intelligence and virtual reality, and a lesser but significant interest in design, data, health and social media.